In this tutorial you will step through a voxel-wise modeling analysis. You will use computational models to extract semantic features from a natural speech stimulus. Then these features will be used to build linear models of fMRI data, and model weights and prediction performance will be visualized.
Because we don't assume you have any background knowledge in Python, you don't need to modify any code to step through the whole assignment. However, you are asked to answer a number of questions throughout the notebook to show that you understand the concepts. If you feel really passionate about this homework and want to explore more, talk to Brian and Yuchen, and we can figure out if you can earn extra scores on this assignment so that you may skip another one. For example, you may use phoneme features instead of semantic features as the input of the model (you may want to have a look at histogram_phonemes2 and phonemes from the dsutils.py), or you may use a more powerful model (e.g. neural networks) instead of the linear models to predict brain responses from words. You can even compare the observations from this notebook with the results reported in the Huth 2016 Nature paper (you can find it on Canvas) and write an essay.
When submitting the assignment, make sure that you submit SpeechModel.ipynb and questions.txt with your answers to the questions. Submit two seperate files (don't zip them).
In this experiment a subject underwent fMRI scanning while they listened to roughly 2 hours of natural narrative speech stimuli. These stimuli were 10-15 minute complete stories drawn from The Moth Radio Hour, a radio program where storytellers tell true, autobiographical stories in front of a live audience.
This notebook is written in python, and is presented using the iPython notebook. To evaluate a block of code, click in it to select it, and then press shift-Enter.
# Run this cell if your computer has a 'retina' or high DPI display. It will make the figures look much nicer.
%config InlineBackend.figure_format = 'retina'
# This cell imports libraries that you will need
# Run this.
from matplotlib.pyplot import figure, cm
import numpy as np
import logging
Here you will load a precomputed vector-space semantic model. This semantic model will transform any word (well, any word it knows about) into a 985-dimensional vector. This 985-dimensional semantic space has the nice property that words that are close together tend to have similar meanings. Although it would have been fun to have tried reconstructing this semantic model in this tutorial, it takes a really long time and it doesn't seem like the parameters matter that much. So today you're just going to work with the preconstructed semantic model.
The semantic model was constructed using a decently large corpus of text (a couple billion words, comprising the stories used as stimuli here, 604 popular books, 2,405,569 wikipedia pages, and 36,333,459 user comments scraped from reddit.com) and a lexicon of roughly 10,000 words. We selected 985 "basis words" from the Wikipedia "List of 1000 basic words" (contrary to the title, this lost does not actually contain 1000 words, but this is where the title of the model comes from). These are common words that span many topics.
We constructed a word co-occurrence matrix, $M$, with 985 rows and 10,470 columns. Iterating through the training corpus, we added 1 to $M_{ij}$ each time word $j$ appeared within 15 words of basis word $i$. The window size of 15 was selected to be large enough to suppress syntactic effects (word order) but no larger. Once the co-occurrence matrix was complete, we log transformed the counts, replacing $M_{ij}$ with $\log(1 + M_{ij})$. Then each row of M was z-scored to correct for differences in basis word frequency, and finally each column of $M$ was z-scored to correct for word frequency. The resulting matrix is the one you're loading here.
Q1: In your own words, explain how the semantic model transforms words into 985-dimensional vectors. If a vector has a high value on a specific dimension, what does that mean?
# Load semantic model
from SemanticModel import SemanticModel
eng1000 = SemanticModel.load("data/english1000sm.hf5")
# You can get the vector for a word by indexing the model with that word
# For example, the vector for "finger":
print("Vector length:",len(eng1000["finger"]))
print(np.around(eng1000["finger"],decimals=2))
Vector length: 985 [-0.03 -0.16 0.37 0.63 -0.54 -1.47 -1.06 -0.42 0.12 -0.52 0.12 -0.93 -0.11 -0.58 0.03 -0.24 -0.59 0.07 -0.51 0.15 -0.3 -0.18 -0.52 -0.27 0.51 -0.1 -0.04 -0.31 0.3 0.24 -0.19 -0.32 -0.53 -0.71 2.54 -1.85 0.67 -0.35 -0.68 -0.52 0.25 0.1 -0.79 0.18 -2.21 0.28 0.62 0.68 0.21 0.98 1.5 -1.06 0.31 0.53 -0.29 -0.44 1.11 -0.19 0.49 0.07 0.65 -0.19 -0.52 0.59 0.62 -0.41 0.35 -0.67 -0.58 -0.07 0.34 0.16 0.95 -0.92 -0.14 1.13 3.65 -0.18 -0.16 1.8 0.98 -0.19 -0.43 -0.81 0.37 0.91 3.72 -0.61 1.26 -1.01 -0.32 -0.48 1.72 1.33 0.98 0.36 0.32 -0.63 -1.11 1.12 0.72 0.4 0.19 -0.76 0.3 -0.13 -0.3 -0.12 1.67 -1.06 1.07 -0.52 -1.53 -0.06 -0.06 -0.54 -0.77 0.49 -0.05 0.22 1.72 0.98 0.07 0.3 -0.36 0.88 -1.24 -0.52 -0.65 0.52 0.39 0.48 -1.63 -0.41 0.37 -0.3 -0.5 -1.92 -0.43 0.61 1.85 -0. -0.85 0.77 -0.72 -0.45 1.08 -0.58 -0.91 1.38 -0.1 0.26 0.12 0.68 0.15 0.54 -0.45 -0.39 0.32 0.47 -0.49 1.43 0.74 -0.96 0.02 2.18 0.1 -0.28 -0.55 -0.11 -0.66 -0.31 -0.48 -0.39 -0.46 -0.34 -0.71 0.42 -1.49 -0.41 0.96 0.01 -1.1 0.68 -0.88 -0.6 0.51 -1.11 -0.69 0.05 0.12 1.19 2.53 0.26 -0.56 -0.38 -0.83 -0.46 -0.06 -0.52 0.63 -0.79 -1.54 0.92 0.43 -0.56 -0.09 -0.27 -0.41 0.69 0.95 1.35 -1.22 0. 0.26 0.18 1.39 0.6 0.89 0.91 -0.59 -0.41 0.42 -0.51 0.47 0.33 -0.16 1.09 -1.73 0.49 3.42 -1.44 -0.83 -0.77 -0.92 0.24 0.54 -1.8 -0.27 -0.34 -0.51 -0.13 -0.47 -1.56 0.4 0.26 0.1 -0.91 -0.08 0.21 -0.1 -0.71 -0.75 -1.19 -0.14 -0.95 -1.47 0.02 -0.09 0.12 0.13 0.03 0.15 -0.5 -0.11 -0.9 -0.33 -0.25 -1.24 -0.07 -0.28 1.8 1.27 -0.62 -0.27 -0.09 -0.61 -1.03 -1.1 -0.17 -1.16 0.75 1.17 -0.78 0.62 -0.2 -0.41 0.66 -0.5 -0.88 -0.28 -0.04 0.05 -1.39 -0.1 0.16 7.21 -0.45 0.01 -0.45 0.91 0.66 0.37 -0.86 0.77 -0.01 0.47 -0.45 0.24 1.68 0.4 0.71 1.05 -0.83 -0.63 -0.44 -1.53 -0.96 0.16 0.02 1.16 -0.63 -0.33 -1.32 -0.64 -0.96 0.96 -0.84 0.35 -0.59 -0.61 0.75 0.32 -0.41 -0.07 0.37 -0.66 -0.73 -1.12 -0.52 -1.02 -0.95 -1.71 0.72 0.34 -0.92 0.96 0.29 1.93 0.13 -0.57 0.25 0.16 -0.12 -0.01 -1.06 -0.54 -0.62 -0.16 -0.91 -0.44 -0.92 -0.02 -1.15 0.66 1.83 1.32 0.35 -1.73 2.15 3.37 0.1 -0.39 0.63 0.83 -0.07 -0.14 0.05 0.51 -2.3 0.18 -0.07 -0.41 -0.19 1.38 0.19 -0.07 1.11 0.02 1.52 0.05 -1. -1.39 0.38 0.74 0.69 -1.58 1.49 2.84 -2.11 -0.53 -0.4 -0.27 0.45 0.8 -1.54 -0.6 -1.23 0.05 -0.6 -0.47 0.53 1.93 -0.58 0.56 0.28 -0.11 0.24 -1.11 -0.62 -0.87 1.4 0.05 -1.06 -2.31 0.04 -0.3 1.15 0.34 -0.59 1.66 -0.58 -1.24 -0.66 0.06 0.46 0.72 1.04 -0.24 0.2 -0.88 0.83 2.14 3.01 0.48 0.19 0.36 0.74 1.66 -1.1 -0.68 -0.4 -1.26 -1.72 0.72 0.72 -0.68 -0.18 0.07 0.13 1.71 0.98 -2.67 1.48 -0.33 0.53 0.62 0.38 -0.75 -0.44 -0.86 0.37 0.4 0.65 3.46 -1.08 0.01 1.3 -1.11 2.37 -1.88 0.87 0.72 1. -0.2 -0.34 -0.6 -0.33 0.29 0.01 -1.09 0.05 -0.6 0.21 -0.94 -0.11 0.15 -1.44 -0.17 -0.62 -0.7 0.7 -0.15 0.26 0.73 0.58 -0.92 -1.17 -1.01 0.1 0.45 3.5 -0.02 -1.54 -0.15 0.77 0.41 0.75 -0.48 -1.17 -0.93 0.56 -0.93 0.59 -0.75 0.61 -0.41 -0.54 -0.72 -0.2 0.63 2.75 1.32 -0.23 -0.7 -0.6 1.62 -0.81 -0.07 -1.15 -1.18 -0.12 -0.05 1.84 -0.05 3.76 -3.75 -0.47 -0.54 0.28 -0.39 -1.5 -1.37 0.13 0.12 -0.48 -0.56 -0.06 -1.71 0.38 -0.44 -0.28 -0.07 3.81 -0.27 0.02 0.68 0.01 -0.79 -1.43 0.48 -2.04 -0.6 1.49 -0.86 -0.94 -0.58 -0.34 -0.38 0.39 0.35 -0.19 0.6 0.59 -0.07 -0.43 -0.92 -0.3 -0.26 0.35 -0.36 0.05 -0.29 0.29 1.6 1.82 0.43 -1.01 0.96 -1.05 -0.12 -0.19 -0.86 -0.98 -0.27 -0.33 -0.76 -0.73 -0.65 2.96 3.88 -0.47 -0.57 0.31 -1.55 0.12 0.83 0.8 0.9 0.85 1.15 -0.03 1.69 1.14 0.02 -0.39 -1.04 1.47 0.95 0.12 0.08 0.5 -0.46 1.44 1.46 0.16 -0.35 -1.07 0.23 -0.45 -0.88 0.35 -0.47 0.71 0.48 -0.67 -1.27 1.17 0.18 -0.09 -1.13 -0.57 -0.93 -1.72 -1.92 -0.02 -0.14 -0.58 0.13 -0.12 -0.38 -1.47 -1.29 2.32 0.42 -0.01 1.78 2.44 -0.88 -0.45 0.96 -0.23 1.1 -0.93 -1.05 1.06 0.74 -0.1 1.12 -0.25 0.25 -0.97 -1.05 -0.02 0.4 0.88 1.22 -1.75 -0.34 0.47 0.04 -1.07 0.46 -0.14 -0.58 -0.78 0.09 -1.14 -0.46 1.32 5.11 -1.48 -1.16 -0.12 -0.11 0.33 0.31 2.24 1.1 -0.43 -0.67 -0.09 -0.43 -0.42 -0.29 -4.1 0.07 -0.09 0.17 -0.15 0.39 -1.09 -1.1 -0.75 -0.78 -0.05 0.12 -0.04 -1.48 0.11 -0.44 -1.07 -0.66 -0.98 0.18 0.48 0.72 1.94 0.96 -0.87 2.13 0.42 0.78 0.99 1.27 -0.55 -1.99 1.37 0.7 1.66 0.05 -0.32 0.11 1.89 1.32 -0.49 -0.21 0.77 0.81 0.71 0.49 -0.1 -0.22 -0.01 -0.77 -0.77 0.02 0.59 -0.09 0.23 -0.43 0.99 -0.51 2.37 0.65 0.72 0.04 2.95 0.36 -0.25 2. 0.83 0.04 0.06 0.32 0.69 1.36 0.93 -0.25 0.5 0.73 0.08 -0.82 -0.36 0.02 0.55 0.33 -1.15 -0.31 -0.36 -1.01 -0.05 -0.01 -0.73 2.33 -1.24 0.46 -0.83 -0.82 -0.28 0.25 -0.68 0.06 -1.61 -0.4 0.2 -1.64 0.66 0.03 1.11 0.13 0.55 -0.21 -1.39 -0.26 0.27 -1. -0.55 -0.78 -1.19 -0.95 0.29 -1.14 -0.29 -1.41 -0.89 0.11 0.22 -1.16 -1.03 -0.44 -1. -1.16 0.24 -0.22 0.28 0.34 0.72 1.08 0.2 -0.42 -0.33 0.91 0.08 0.34 -0.2 -1.45 1.14 -0.78 -1.24 0.58 -0.32 -0.74 -0.42 0.46 -0.11 -0.24 -0.29 -0.3 0.66 -0.2 -1.31 -0.82 1.04 1.12 0.44 0.05 0.38 -0.24 -0.04 -0.91 -0.22 -3.15 0.62 -1.35 -0.27 -0.92 4.96 0.16 -0.46 -0.49 0.24 0.29 1.96 0.78 -1.59 2.8 -0.87 -0.91 -1.58 -0.27 -0.43 -0.54 -0.79 0.69 0.36 -0.04 -0.21 0.27 0.3 -0.19 -0.28 -1.05 -0. 0.57 0.58 -0.54 -0.19 0.04 0.15 -0.29 -1.09 -1.14 0.05 0.49 -0.19 0.22 0.2 0.75 2.07 -0.33 0.19 0.35 0.15 -0.26 0.42 1.09 -1.82 2.29 -0.49 0.27 -1.34 -0.36 -1.25 2.39 0.03 1.23 0.26 -0.13 -0.69 0.45 -0.4 -0.64 0.27 -0.31 -0.22 -0.85 -0.15 -1.26 0.34 1.16 1.59 -1.18 0.48 -1.27 -0.11 0.03 0.21 0.34 0.2 -0.16 -0.63 -0.97 -0.04 0.31 -0.41 0.67 -0.95 0.19 0.31 -0.28 0.47 -0.77 1.56 -0.38]
First let's plot the length 985 vector for one word to see what it looks like.
plot_word = "finger"
f = figure(figsize=(15,5))
ax = f.add_subplot(1,1,1)
ax.plot(eng1000[plot_word], 'k') # 'k' means black
ax.axis("tight")
ax.set_title("English1000 representation for %s" % plot_word)
ax.set_xlabel("Feature number")
ax.set_ylabel("Feature value")
Text(0, 0.5, 'Feature value')
Next let's plot the vectors for three words: "finger", "fingers", and "grief". Here you will see that "finger" (in black) and "fingers" (in red) look very similar, but "grief" (in blue) looks very different. Neat.
plot_words = ["finger", "fingers", "language"]
colors = ["black", "red", "blue"]
f = figure(figsize=(15,5))
ax = f.add_subplot(1,1,1)
wordlines = []
for word, color in zip(plot_words, colors):
wordlines.append(ax.plot(eng1000[word], color)[0])
ax.axis("tight")
ax.set_title("English1000 representations for some words")
ax.set_xlabel("Feature number")
ax.set_ylabel("Feature value")
ax.legend(wordlines, plot_words)
<matplotlib.legend.Legend at 0x11d7dc820>
One nice test of a vector-space semantic model is whether it results in a "semantically smooth" representation of the words. That is, do nearby words in the space have intuitively similar meanings? Here you can test that using the method find_words_like_word.
Give any word (that the model knows about), and it will print out the 10 closest words (that it knows about) and their cosine similarities (or correlations, same thing in this case). This includes the word you supplied.
In this next example it prints the closest words to "finger". All of the 10 closest words are semantically related: 9 are nouns, and 1 is a verb ("stick"; of course this is also a noun, I'm just assuming that the sense of "stick" that's close to "finger" is probably the verb sense, but this brings up an important point: this model does nothing to disambiguate between different word senses!).
# Test semantic model
eng1000.find_words_like_word("finger")
[(1.0, 'finger'), (0.8152045140028774, 'fingers'), (0.6747595355473326, 'hand'), (0.6745201474550794, 'nose'), (0.6618110335849765, 'arm'), (0.6485988426371896, 'mouth'), (0.6477021326212881, 'stick'), (0.6382432365355889, 'neck'), (0.6358168681236411, 'forehead'), (0.6256921294162919, 'tongue')]
Here is just another example, but this one an abstract noun, "language". Again the model does a pretty good job at finding related words.
eng1000.find_words_like_word("language")
[(1.0, 'language'), (0.8560080667907979, 'languages'), (0.6499251700967138, 'understanding'), (0.6254849445950689, 'philosophy'), (0.6233664577929342, 'learning'), (0.62227077633301, 'ideas'), (0.6214787276220983, 'philosophical'), (0.6177845454632054, 'expressions'), (0.6156760397365069, 'subject'), (0.6124065941494572, 'interpretation')]
eng1000.find_words_like_vec(eng1000["king"] - eng1000["man"] + eng1000["woman"])
[(0.8110762236714159, 'king'), (0.7535193223645585, 'queen'), (0.744473355043028, 'prince'), (0.6961617552766914, 'elizabeth'), (0.6959339192996907, "king's"), (0.680598722525226, 'maria'), (0.6744722873814397, 'princess'), (0.6737776089077288, 'duke'), (0.6661078636656458, 'sons'), (0.6649050818050299, 'eldest')]
Q2: Choose a new word and see what the model comes up with. Does the output make sense to you?
Next we're going to load up the stimuli. We're not going to be dealing with the actual audio of the stories that were presented, but instead with aligned transcripts. These were generated using the UPenn forced aligner (P2FA), which figures out when each word was spoken given the transcript and the audio. The transcripts are stored in TextGrid format (native to Praat), which can be loaded directly into Python using some code from the natural language toolkit (NLTK).
Here you will load the TextGrids for the stories, as well as 'TRfiles', which specify the time points relative to story onset when the fMRI data was collected (roughly every 2 seconds).
Finally the TextGrids and TRfiles will be combined together into a representation I call a DataSequence. There is nothing interesting going on here scientifically, this is just something to make subsequent steps more manageable.
# These are lists of the stories
# Rstories are the names of the training (or Regression) stories, which we will use to fit our models
Rstories = ['alternateithicatom', 'avatar', 'howtodraw', 'legacy',
'life', 'myfirstdaywiththeyankees', 'naked',
'odetostepfather', 'souls', 'undertheinfluence']
# Pstories are the test (or Prediction) stories (well, story), which we will use to test our models
Pstories = ['wheretheressmoke']
allstories = Rstories + Pstories
# Load TextGrids
from stimulus_utils import load_grids_for_stories
grids = load_grids_for_stories(allstories)
# Load TRfiles
from stimulus_utils import load_generic_trfiles
trfiles = load_generic_trfiles(allstories)
# Make word and phoneme datasequences
from dsutils import make_word_ds, make_phoneme_ds
wordseqs = make_word_ds(grids, trfiles) # dictionary of {storyname : word DataSequence}
phonseqs = make_phoneme_ds(grids, trfiles) # dictionary of {storyname : phoneme DataSequence}
Before going on, let's play with the DataSequences a bit, both so you can see what the data structure looks like, and also so you can see what the stimuli look like.
naked = wordseqs["naked"]
# The DataSequence stores a lot of information
# naked.data is a list of all the words in the story
print ("There are %d words in the story called 'naked'" % len(list(naked.data)))
There are 3218 words in the story called 'naked'
# We can print out the first 100 words like this
print (list(naked.data)[:100])
['i', 'grew', 'up', 'in', 'a', 'really', 'small', 'town', 'in', 'alabama', 'and', 'my', 'sister', 'was', 'and', 'is', 'to', 'this', 'day', 'a', 'remarkably', 'beautiful', 'woman', 'and', 'i', 'was', 'always', 'really', 'good', 'in', 'school', 'so', 'it', 'fell', 'into', 'this', 'pattern', 'with', 'my', 'family', 'where', 'people', 'would', 'say', 'things', 'like', "she's", 'the', 'beauty', "you're", 'the', 'brain', 'and', 'among', 'my', 'almost', 'anorexically', 'petite', 'friends', 'growing', 'up', 'at', 'five', 'foot', 'five', 'a', 'hundred', 'and', 'forty', 'pounds', 'it', 'was', 'universally', 'acknowledged', 'that', 'i', 'was', 'the', 'fat', 'one', 'the', 'whole', 'thing', 'really', 'started', 'in', 'first', 'grade', 'i', 'have', 'to', 'say', 'there', 'was', 'a', 'weigh', 'in', 'in', 'school', 'my']
# or, if you want it to be more readable, like this
print (" ".join(list(naked.data)[:100]))
i grew up in a really small town in alabama and my sister was and is to this day a remarkably beautiful woman and i was always really good in school so it fell into this pattern with my family where people would say things like she's the beauty you're the brain and among my almost anorexically petite friends growing up at five foot five a hundred and forty pounds it was universally acknowledged that i was the fat one the whole thing really started in first grade i have to say there was a weigh in in school my
# the datasequence also stores when exactly each word was spoken (this time corresponds to the middle of each word)
print (naked.data_times[:10])
[0.19705215 0.38662132 0.59614512 0.75675316 0.85153774 1.02517007 1.32947846 1.6537415 1.84829932 2.19251701]
# and it also stores the time of the middle of each fMRI acquisition (each acqusition takes 2.0045 seconds)
# these times are relative to story start, so the fMRI scan started 10 seconds before the story
print (naked.tr_times[:10])
[-9.02085041 -7.04012641 -5.03223841 -3.03203841 -1.02396941 0.98489959 2.98378359 4.99206359 6.99189259 8.99996159]
# and it also makes it easy to, for example, find the words that were spoken during each fMRI acquisition
# (the first few are empty because they came before the story started)
print (naked.chunks()[:10])
[array([], dtype='<U15'), array([], dtype='<U15'), array([], dtype='<U15'), array([], dtype='<U15'), array([], dtype='<U15'), array(['i', 'grew', 'up', 'in', 'a', 'really', 'small', 'town', 'in',
'alabama'], dtype='<U15'), array(['and', 'my', 'sister'], dtype='<U15'), array(['was', 'and', 'is', 'to', 'this', 'day', 'a', 'remarkably'],
dtype='<U15'), array(['beautiful', 'woman', 'and', 'i', 'was', 'always'], dtype='<U15'), array(['really', 'good', 'in', 'school'], dtype='<U15')]
Q3: In your own words, explain how we align fMRI data with words temporally.
The next step in this analysis is that you need to project each word in the stimulus into the English1000 semantic feature space that you loaded above. I wrote a nice function to do this called make_semantic_model that simply takes the word DataSequence and the semantic model, and spits out a new DataSequence where each word is replaced by a 985-dimensional vector.
# Project stimuli
from dsutils import make_semantic_model
semanticseqs = dict() # dictionary to hold projected stimuli {story name : projected DataSequence}
for story in allstories:
semanticseqs[story] = make_semantic_model(wordseqs[story], eng1000)
# take a look at the projected stimuli
naked_proj = semanticseqs["naked"]
print (naked_proj.data.shape) # prints the shape of 'data' as (rows, columns)
print (naked_proj.data[:5]) # print the first 5 rows (this will be truncated)
(3218, 985) [[-0.48074415 0.10393176 -0.45596143 ... -0.03175243 -0.62248756 -0.32170921] [ 0.09911849 0.06500191 -0.78847992 ... 1.95413789 -0.94043714 -1.0863888 ] [-0.73058513 -0.37099858 -0.59858174 ... -0.30655477 -0.30774183 -0.66815734] [-0.61541241 -1.1005762 -0.33248156 ... 0.23637707 -1.5470721 -0.6521566 ] [-0.73103245 -0.77222486 -0.42595759 ... 0.21817402 -1.33488523 -0.51994927]]
In order to build a model, you need to downsample the semantic representations of the stimuli to the same temporal scale as the fMRI responses that you will be modeling. The DataSequence provides a method that does this, called chunksums.
# Downsample stimuli
interptype = "lanczos" # filter type
window = 3 # number of lobes in Lanczos filter
downsampled_semanticseqs = dict() # dictionary to hold downsampled stimuli
for story in allstories:
downsampled_semanticseqs[story] = semanticseqs[story].chunksums(interptype, window=window)
Doing lanczos interpolation with cutoff=0.499 and 3 lobes. Doing lanczos interpolation with cutoff=0.499 and 3 lobes. Doing lanczos interpolation with cutoff=0.499 and 3 lobes. Doing lanczos interpolation with cutoff=0.499 and 3 lobes. Doing lanczos interpolation with cutoff=0.499 and 3 lobes. Doing lanczos interpolation with cutoff=0.499 and 3 lobes. Doing lanczos interpolation with cutoff=0.499 and 3 lobes. Doing lanczos interpolation with cutoff=0.499 and 3 lobes. Doing lanczos interpolation with cutoff=0.499 and 3 lobes. Doing lanczos interpolation with cutoff=0.499 and 3 lobes. Doing lanczos interpolation with cutoff=0.499 and 3 lobes.
Next you're going to visualize what the downsampling did. Here you're going to plot the value of one semantic feature (feature 2, which is actually the third feature: zero-based indexing) for each word, and also the downsampled vector.
# Plot the result
s_words = wordseqs["naked"]
s_sem = semanticseqs["naked"]
s_semdown = downsampled_semanticseqs["naked"]
f = figure(figsize=(15,5))
f.clf()
schan = 2
ax = f.add_subplot(1,1,1)
wordstems = ax.stem(s_sem.data_times,
s_sem.data[:,schan] / np.abs(s_sem.data[:,schan]).max(),
linefmt="k-", markerfmt="k.", basefmt="k-")
interps = ax.plot(s_sem.tr_times,
s_semdown[:,schan] / np.abs(s_semdown[:,schan]).max(), 'r.-')
ax.set_xlim(-6, 60)
ax.set_ylim(-1, 1)
ax.set_xlabel("Time (seconds since story start)")
ax.set_ylabel("Semantic feature value")
ax.legend((wordstems, interps[0]), ("Individual words", "Downsampled feature"));
Q4: Explain the figure you got above. Specify what x axis and y axis represent.
Next you're going to combine together all the features from all the stories into one big matrix. Within this operation, you're also going to z-score each feature within each story. This operation subtracts off the mean and then divides by the standard deviation. This might seem like a weird or incomprehensible thing to do, but I do it because the responses to each story are z-scored individually. Anyway not a big deal.
The features for each story are also trimmed a bit (the variable trim determines how many time points are removed from the beginning and end of each story). The fMRI responses at the beginnings and ends of the stories are often noisier than at other times because of transients and problems with detrending (an fMRI preprocessing step that you don't need to worry about here aside from this point).
The combined features are stored in big matrices called Rstim (with the training, or Regression stimuli) and Pstim (with the test, or Prediction stimuli).
# Combine stimuli
from npp import zscore
trim = 5
Rstim = np.vstack([zscore(downsampled_semanticseqs[story][5+trim:-trim]) for story in Rstories])
Pstim = np.vstack([zscore(downsampled_semanticseqs[story][5+trim:-trim]) for story in Pstories])
storylens = [len(downsampled_semanticseqs[story][5+trim:-trim]) for story in Rstories]
print(storylens)
print(np.cumsum(storylens))
[343, 367, 354, 400, 430, 358, 422, 404, 355, 304] [ 343 710 1064 1464 1894 2252 2674 3078 3433 3737]
# Print the sizes of these matrices
print ("Rstim shape: ", Rstim.shape)
print ("Pstim shape: ", Pstim.shape)
Rstim shape: (3737, 985) Pstim shape: (291, 985)
Next you are going to concatenate multiple delayed versions of the stimuli, in order to create a linear finite impulse response (FIR) model. This is a vitally important step, and is conceptually a bit difficult, so take a few minutes to make sure you understand what is going on here.
First you need to understand the problem that the FIR model is solving. fMRI measures the blood-oxygen level dependent (BOLD) signal, which is a complicated and nonlinear combination of blood oxygenation and blood volume. When neurons in an area of the brain become active, they start using up lots of energy. To compensate, nearby blood vessels dilate so that more oxygen and glucose become available to the neurons. The resulting changes in blood oxygenation (which increases) and volume (which also increases) create the magnetic signature that is recorded by fMRI.
But this process is slow. It takes seconds after the neural activity begins for the blood vessels to dilate and for the BOLD response to become apparent. And then it takes more seconds for the response to go away. So although a neural response might only last milliseconds, the associated BOLD response will rise and fall over a span of maybe 10 seconds, orders of magnitude slower. The shape of this rise and fall is called the hemodynamic response function (HRF).
Here is a pretty standard looking example of an HRF:

To accurately model how the brain responds to these stimuli we must also model the HRF. There are many ways to do this. The most common is to assume that the HRF follows a canonical shape. But this approach turns out to not work very well: different parts of the brain have very different vasculature (blood vessels), so the HRF shape can vary a lot.
Instead, what you are going to here is estimate a separate HRF for each semantic feature in each voxel that is being modeled. This estimate is going to take the form of a linear finite impulse response (FIR) model. The linear FIR form is particularly nice to use because it's very simple to estimate and powerful (if anything, it might be too powerful.. more on that later). To build a linear FIR model all you have to do is concatenate together multiple delayed copies of the stimulus. I usually use four delays: 1, 2, 3, and 4 time points. The resulting delayed features can be thought of as representing the stimulus 1, 2, 3, and 4 time points ago. So the regression weights for those features will represent how a particular voxel responds to a feature 1, 2, 3, or 4 time points in the past, and these regression weights are a 4-point estimate of the HRF for that feature in that voxel.
The potential downside of the FIR model is that it may be too expressive. Each feature in each voxel is allowed to have any HRF, but this comes at the cost of multiplying the total number of regression weights that we must fit by the number of delays. In all likelihood the true HRFs vary, but they don't vary that much, so we probably don't need this many independent features. This cost becomes apparent if you increase the number of delays. This will slow down model fitting and likely decrease the stability of the regression weights, leading to decreased model performance.
# Delay stimuli
from util import make_delayed
ndelays = 4
delays = range(1, ndelays+1)
print ("FIR model delays: ", delays)
delRstim = make_delayed(Rstim, delays)
delPstim = make_delayed(Pstim, delays)
FIR model delays: range(1, 5)
# Print the sizes of these matrices
print ("delRstim shape: ", delRstim.shape)
print ("delPstim shape: ", delPstim.shape)
delRstim shape: (3737, 3940) delPstim shape: (291, 3940)
Q5: In your own words, explain what problem is addressed by FIR model and how the model addresses it.
Next you will load the fMRI data. This is totally the most exciting part! These responses have already been preprocessed (the 3D images were motion corrected and aligned to each other, detrended, and then z-scored within each stimulus) so you don't have to worry about that.
You will load three different variables: zRresp, the responses to the regression dataset; zPresp, the responses to the prediction dataset; and mask, which is a 3D mask showing which voxels have been selected (we are not modeling every voxel in the scan, that would take forever, we are only modeling the voxels that overlap with the cerebral cortex).
# Load responses
import tables
resptf = tables.open_file("data/fmri-responses.hf5")
zRresp = resptf.root.zRresp.read()
zPresp = resptf.root.zPresp.read()
mask = resptf.root.mask.read()
# Print matrix shapes
print ("zRresp shape (num time points, num voxels): ", zRresp.shape)
print ("zPresp shape (num time points, num voxels): ", zPresp.shape)
print ("mask shape (Z, Y, X): ", mask.shape)
zRresp shape (num time points, num voxels): (3737, 37226) zPresp shape (num time points, num voxels): (291, 37226) mask shape (Z, Y, X): (31, 100, 100)
Next you will visualize where the voxels are coming from in the brain. This will give you an idea of where the data come from.
First you will plot a single slice through the mask in the Z dimension. This is called an "axial" slice. The top of the image is the front of the brain, the bottom is the back. The left side of the image is the right side of the brain, and the right side of the image is the left side of the brain (as if you are looking up at the brain from under the subject's chin; this left-right reversal is often referred to as "radiological coordinates", as opposed to "neurological coordinates" where you are looking down from the top).
Then you will plot a mosaic of all the slices. This is done using the function mosaic from James Gao's pyCortex package.
# Plot one slice of the mask that was used to select the voxels
f = figure()
ax = f.add_subplot(1,1,1)
ax.matshow(mask[16], interpolation="nearest", cmap=cm.gray) # show the 17th slice of the mask
<matplotlib.image.AxesImage at 0x124080910>
# Plot mask mosaic
import cortex
f = figure(figsize=(10,10))
cortex.mosaic(mask, cmap=cm.gray, interpolation="nearest");
Next you will visualize the responses of a few selected voxels over time. I selected these particular voxels because they are reasonably well explained by the semantic model, but have some differences in their responses across time.
# Plot the response of a few voxels over time
selvoxels = [20710, 27627, 24344, 34808, 22423, 25397]
f = figure(figsize=(15, 5))
ax = f.add_subplot(1,1,1)
for ii,vi in enumerate(selvoxels):
ax.plot(zRresp[:500, vi] - 5 * ii)
ax.set_xlim(0, 500)
ax.set_yticks([])
ax.set_xticks(range(0, 500, 50))
ax.set_xlabel("Time (fMRI volumes)")
ax.set_ylabel("Voxel responses")
ax.grid()
Finally, the core of the analysis: you will fit a regression model that predicts the responses of each voxel as a weighted sum of the semantic features. This model will then be tested using a held out dataset (the Prediction dataset). And if the model proves to be reasonably predictive, then the weights of the regression model will tell us what semantic features each voxel responds to.
This is a linear regression model, so if the response time course for voxel $j$ is $R_j$, the stimulus time course for semantic feature $i$ is $S_i$, and the regression weight for feature $i$ in voxel $j$ is $\beta_{ij}$, then the model can be written as:
$$\hat{R}_j = \beta_{0j} S_0 + \beta_{1j} S_1 + \cdots$$or:
$$\hat{R}_j = \sum_i \beta_{ij} S_i$$The trick, of course, is accurately estimating the $\beta_j$ values. This is commonly done by minimizing the sum of the squared error (here across time, $t$):
$$E_j(\beta) = \sum_t (R_{jt} - \hat{R}_{jt})^2 = \sum_t (R_{jt} - \sum_i \beta_{i} S_{it})^2$$$$\beta_j = \underset{\beta}{\operatorname{argmin}} E_j(\beta)$$Computing $\beta$ this way is called ordinary least squares (OLS), and this will not work in our case because the total number of features (3940) is smaller than the number of time points (3737). (It would be possible if the number of delays was smaller than 4, but it would give terrible results.. feel free to try it! OLS can be performed using the function np.linalg.lstsq.)
In almost every case, linear regression can be improved by making some prior assumptions about the weights (or, equivalently, about the covariance structure of the stimuli). This is called regularization, or regularized linear regression. One way to do this is to penalize the error function by the sum of the squared weights. This is commonly known as ridge regression, and is a special case of Tikhonov regularization. It finds the $\beta$ that minimizes the following error function:
$$E_j(\beta) = \sum_t (R_{jt} - \sum_i \beta_{i} S_{it})^2 + \alpha \sum_i \beta_i^2$$(In practice we will use a different formulation that involves re-weighting the singular values of the matrix $S$ before computing its pseudoinverse. This method achieves the same results but is extremely efficient because it uses all the linear algebra machinery that computers are so good at to build many models in parallel.)
You may have noticed in the equation above that we have introduced a new parameter, $\alpha$, which controls the strength of the regularization. If $\alpha$ is set to zero, then we get back to exactly the OLS formulation (above). As $\alpha$ goes to infinity, the regularization forces all the weights to go to zero (in practice this also has the slightly weirder effect of making all the weights independent, as if each feature was regressed separately on the responses).
So how do we choose $\alpha$? We're going to do it here using cross-validation. First, we split the Regression dataset up into two parts. Then we estimate the weights for a given $\alpha$ on the first part, and test how well we can predict responses on the second part. This is repeated for each possible $\alpha$ that we want to test, and for a couple different splits of the Regression dataset. Then we find the $\alpha^*$ that gave us the best predictions within the split Regression dataset. Finally we estimate the weights using the entire Regression dataset and the selected $\alpha^*$.
Because this is an annoying and laborious process, I've encapsulated it within the function bootstrap_ridge. You simply give this function your datasets, the possible $\alpha$ values, and a few parameters for the cross-validation, and it does all the rest. The parameter nboots determines the number of cross-validation tests that will be run.
To do cross-validation, bootstrap_ridge divides the Regression dataset into many small chunks, and then splits those chunks into the two groups that will be used to estimate weights and test $\alpha$ values. This is better than just choosing individual time points because both the fMRI data and stimuli are autocorrelated (i.e. correlated across time). The parameter chunklen determines the length of the chunks, and the parameter nchunks determines the number of chunks in the $\alpha$-testing dataset. By default I set chunklen to 40 time points (80-second chunks), and set nchunks to 20 (40 * 20 = 800 time points for testing $\alpha$ values, 3737-800 = 2937 time points for estimating weights). These values should not matter too much.
Running the regression will take a few minutes.
Q6: We are predicting something from something. What are these? In other words, what are the inputs and outputs of the regression model? Be as specific as possible.
# Run regression
from ridge import bootstrap_ridge
alphas = np.logspace(1, 3, 10) # Equally log-spaced alphas between 10 and 1000. The third number is the number of alphas to test.
nboots = 1 # Number of cross-validation runs.
chunklen = 40 #
nchunks = 20
wt, corr, alphas, bscorrs, valinds = bootstrap_ridge(delRstim, zRresp, delPstim, zPresp,
alphas, nboots, chunklen, nchunks,
singcutoff=1e-10, single_alpha=True)
f = figure()
ax = f.add_subplot(1,1,1)
ax.semilogx( np.logspace(1, 3, 10), bscorrs.mean(2).mean(1), 'o-')
[<matplotlib.lines.Line2D at 0x147d0cd30>]
Next let's have a look at the variables returned by the regression function.
# wt is the regression weights
print ("wt has shape: ", wt.shape)
# corr is the correlation between predicted and actual voxel responses in the Prediction dataset
print ("corr has shape: ", corr.shape)
# alphas is the selected alpha value for each voxel, here it should be the same across voxels
print ("alphas has shape: ", alphas.shape)
# bscorrs is the correlation between predicted and actual voxel responses for each round of cross-validation
# within the Regression dataset
print ("bscorrs has shape (num alphas, num voxels, nboots): ", bscorrs.shape)
# valinds is the indices of the time points in the Regression dataset that were used for each
# round of cross-validation
print ("valinds has shape: ", np.array(valinds).shape)
wt has shape: (3940, 37226) corr has shape: (37226,) alphas has shape: (37226,) bscorrs has shape (num alphas, num voxels, nboots): (10, 37226, 1) valinds has shape: (1, 800)
The bootstrap_ridge function already computed predictions and correlations for the Prediction dataset, but this is important so let's reproduce that step more explicitly.
Remember that according to the linear model, the predicted responses for each voxel are a weighted sum of the semantic features. An easy way to compute that is by taking the dot product between the weights and semantic features: $$\hat{R} = S \beta$$
# Predict responses in the Prediction dataset
# First let's refresh ourselves on the shapes of these matrices
print ("zPresp has shape: ", zPresp.shape)
print ("wt has shape: ", wt.shape)
print ("delPstim has shape: ", delPstim.shape)
zPresp has shape: (291, 37226) wt has shape: (3940, 37226) delPstim has shape: (291, 3940)
# Then let's predict responses by taking the dot product of the weights and stim
pred = np.dot(delPstim, wt)
print ("pred has shape: ", pred.shape)
pred has shape: (291, 37226)
Next let's plot some predicted and actual responses side by side.
f = figure(figsize=(15,5))
ax = f.add_subplot(1,1,1)
selvox = 20710 # a decent voxel
realresp = ax.plot(zPresp[:,selvox], 'k')[0]
predresp = ax.plot(pred[:,selvox], 'r')[0]
ax.set_xlim(0, 291)
ax.set_xlabel("Time (fMRI time points)")
ax.legend((realresp, predresp), ("Actual response", "Predicted response"));
You might notice above that the predicted and actual responses look pretty different scale-wise, although the patterns of ups and downs are vaguely similar. But we don't really care about the scale -- for fMRI it's relatively arbitrary anyway, so let's rescale them both to have unit standard deviation and re-plot.
f = figure(figsize=(15,5))
ax = f.add_subplot(1,1,1)
selvox = 20710 # a good voxel
realresp = ax.plot(zPresp[:,selvox], 'k')[0]
predresp = ax.plot(zscore(pred[:,selvox]), 'r')[0]
ax.set_xlim(0, 291)
ax.set_xlabel("Time (fMRI time points)")
ax.legend((realresp, predresp), ("Actual response", "Predicted response (scaled)"));
Now you see that the actual and scaled predicted responses look very similar. We can quantify this similarity by computing the correlation between the two (correlation is scale-free, so it effectively automatically does the re-scaling that we did here). This voxel has high correlation.
# Compute correlation between single predicted and actual response
# (np.corrcoef returns a correlation matrix; pull out the element [0,1] to get
# correlation between the two vectors)
voxcorr = np.corrcoef(zPresp[:,selvox], pred[:,selvox])[0,1]
print ("Correlation between predicted and actual responses for voxel %d: %f" % (selvox, voxcorr))
Correlation between predicted and actual responses for voxel 20710: 0.593783
Next let's compute this correlation for every voxel in the dataset. There are some very efficient ways to do this, but here I've written a for loop so that it's very explicit what's happening. (This should give exactly the same values as the variable corr, which was returned by bootstrap_ridge.)
voxcorrs = np.zeros((zPresp.shape[1],)) # create zero-filled array to hold correlations
for vi in range(zPresp.shape[1]):
voxcorrs[vi] = np.corrcoef(zPresp[:,vi], pred[:,vi])[0,1]
print(voxcorrs) # all correlations
print(voxcorrs[20710]) # correlation for voxel 20710
[ 0.05682109 -0.02634303 -0.03431781 ... -0.08442223 0.08456757 -0.04400821] 0.5937831372644157
Let's start with a supposition: the correlation should not be high everywhere, even if this is a good model of how the brain represents the semantic content of speech. There are parts of the brain that just don't respond to speech, so the correlation should be low in those areas. There are other parts of the brain that respond to speech, but maybe don't represent semantic information, so the correlation should be low in those areas as well.
But let's begin by plotting a histogram of the correlations across the entire brain. This will show generally whether the model is working well or not.
# Plot histogram of correlations
f = figure(figsize=(8,8))
ax = f.add_subplot(1,1,1)
ax.hist(voxcorrs, 100) # histogram correlations with 100 bins
ax.set_xlabel("Correlation")
ax.set_ylabel("Num. voxels");
If the semantic features didn't capture anything about brain activity, then we would expect the histogram to be symmetric and centered around zero. But here we see that it's highly skewed, with lots of positive values. This looks good! This model is working!
Next, let's plot a mosaic of the correlations across the brain, as we plotted the mask earlier.
# Plot mosaic of correlations
corrvolume = np.zeros(mask.shape)
corrvolume[mask>0] = voxcorrs
f = figure(figsize=(10,10))
cortex.mosaic(corrvolume, vmin=0, vmax=0.5, cmap=cm.hot);
In the mosaic we can see that there seem to be some concentrated areas of high correlation. But it's hard to say where in the brain those areas are based on the mosaic. So next you're going to create a fancy 3D visualization of the correlations using pyCortex.
Once you've opened the viewer you'll be presented with a 3D view of the brain with colors showing the correlations. White outlines and labels show the locations of known brain areas (motor, somatosensory, visual, and some language areas). Drag around with your left mouse button to rotate the view, and the right mouse button to zoom in or out.
By default you'll see a view of the cortex as it looks in reality: folded and convoluted. To better see parts of the brain that are hidden down in the folds, you can press "i" to see an inflated view (or drag the "Mix" slider at the bottom of the screen to the middle). This helps to see the data, but you will still need to rotate the brain to see all of it. To make the entire cortex visible at once, you can press "f" to see a flattened view. To create this view we cut the cortical surface at a few locations, and then flattened it out so that it can all be seen at once (but this introduces some distortions).
# Plot correlations on cortex
import cortex
corrvol = cortex.Volume(corr, "S1", "fullhead", mask=mask, vmin=0, vmax=0.5, cmap='hot')
cortex.webshow(corrvol, port=8889, open_browser=False)
Started server on port 8889
<WebApp(Thread-6, started 12539539456)>
pyCortex also offers a simpler way to view the correlations. This method only shows the flat view, but can be embedded right here in the ipython notebook. This should look like the flat view in the 3D viewer.
# Plot correlation flatmap
cortex.quickshow(corrvol, with_rois=False, with_labels=False);
Q7: Based on the analysis you conducted, which parts of the brain respond to natural narrative speech stimuli?
Now that we have a working model, let's try to figure out what semantic features are making each voxel respond. One way to do this is to simulate how the voxel will respond to individual words, and then find the most preferred words for that voxel.
But first we have an issue to contend with: we have separate weights for each delay. We could look at the weights for each delay, but instead here you will average the weights across delays to get a single set of weights for the voxel.
# Undelay voxel weights (average across delays)
import operator
from functools import reduce
udwt = reduce(operator.add, np.split(wt/ndelays, ndelays))
udwt.shape
(985, 37226)
Next you will pick which voxel to visualize. Since many voxels are modeled poorly, we will pick from among the best modeled voxels.
# Sort voxels by correlation so that we can pick a good voxel
# This will sort voxels in decreasing order of correlation
corrsort = np.argsort(corr)[::-1]
# Define function that will print best words for a voxel
import pprint
def print_voxel_words(voxnum):
# find_words_like_vec returns 10 words most correlated with the given vector, and the correlations
voxwords = eng1000.find_words_like_vec(udwt[:,voxnum])
print ("Best words for voxel %d (correlation %0.3f):" % (voxnum, voxcorrs[voxnum]))
pprint.pprint(voxwords)
# Print best words for some voxels
print_voxel_words(corrsort[0]) # best voxel
print_voxel_words(corrsort[14]) # 15th best voxel
Best words for voxel 20710 (correlation 0.594): [(0.42623273274355367, 'sheet'), (0.4225392452781918, 'edges'), (0.411712302265805, 'diameter'), (0.4114919384027769, 'strips'), (0.4100482567572256, 'cardboard'), (0.4099238127158719, 'copper'), (0.4013363008711215, 'steel'), (0.4008018664965085, 'colored'), (0.4007191862573021, 'coloured'), (0.4004989005118823, 'leaf')] Best words for voxel 21577 (correlation 0.489): [(0.3800741341170798, 'innocent'), (0.36910697273755894, 'victim'), (0.3616865775791469, 'murderer'), (0.35691275195478456, 'child'), (0.3554678197597673, 'murder'), (0.34906297905556105, 'life'), (0.349029002056761, 'truth'), (0.34566504729708847, 'guilty'), (0.34416283878664683, 'indeed'), (0.34254418071994436, 'death')]
Q8: Do you see similarities among words that are preferred by the same voxel? What does that imply?